AITopics | scaffold split

Collaborating Authors

scaffold split

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

fb60d411a5c5b72b2e7d3527cfc84fd0-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-11-2026, 05:31:41 GMT

contribution, dataset, final version, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

fb60d411a5c5b72b2e7d3527cfc84fd0-AuthorFeedback.pdf

Neural Information Processing SystemsAug-17-2025, 09:34:13 GMT

We thank the reviewers for their time and valuable feedback. We will clearly mention this contribution in our final version. Our design principle is in contrast to the OpenML-CC18 and the TU datasets [43] Regarding the package description, it is provided with example code snippets in Appendices E.1 and E.2, due to space We thank R4 for pointing out the issue with PRC-AUC. Average Precision (AP), and observed the same trend (see Table above). We will update this in the final version.

contribution, dataset, final version, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

4ec360efb3f52643ac43fda570ec0118-Paper-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 18:39:23 GMT

When additional supervised pretraining step is conducted after self-supervised pretraining, we observe statistically significant improvements.

downstream task, graph, objective, (12 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County > Stanford (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Molecular Topological Profile (MOLTOP) -- Simple and Strong Baseline for Molecular Graph Classification

Adamczyk, Jakub, Czech, Wojciech

arXiv.org Artificial IntelligenceJul-16-2024

We revisit the effectiveness of topological descriptors for molecular graph classification and design a simple, yet strong baseline. We demonstrate that a simple approach to feature engineering - employing histogram aggregation of edge descriptors and one-hot encoding for atomic numbers and bond types - when combined with a Random Forest classifier, can establish a strong baseline for Graph Neural Networks (GNNs). The novel algorithm, Molecular Topological Profile (MOLTOP), integrates Edge Betweenness Centrality, Adjusted Rand Index and SCAN Structural Similarity score. This approach proves to be remarkably competitive when compared to modern GNNs, while also being simple, fast, low-variance and hyperparameter-free. Our approach is rigorously tested on MoleculeNet datasets using fair evaluation protocol provided by Open Graph Benchmark. We additionally show out-of-domain generation capabilities on peptide classification task from Long Range Graph Benchmark. The evaluations across eleven benchmark datasets reveal MOLTOP's strong discriminative capabilities, surpassing the $1$-WL test and even $3$-WL test for some classes of graphs. Our conclusion is that descriptor-based baselines, such as the one we propose, are still crucial for accurately assessing advancements in the GNN domain.

dataset, graph, moltop, (14 more...)

arXiv.org Artificial Intelligence

2407.12136

Country:

North America > United States (0.46)
Europe > Poland > Lesser Poland Province > Kraków (0.04)
Europe > Switzerland (0.04)
Europe > Czechia > Prague (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.48)
Health & Medicine > Therapeutic Area > Immunology (0.48)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Scaffold Splits Overestimate Virtual Screening Performance

Guo, Qianrong, Hernandez-Hernandez, Saiveth, Ballester, Pedro J

arXiv.org Artificial IntelligenceJun-30-2024

Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct compounds. Scaffold split, grouping molecules by shared core structure, is widely considered to reflect this real-world scenario. However, here we show that the scaffold split also overestimates VS performance. The reason is that molecules with different chemical scaffolds are often similar, which hence introduces unrealistically high similarities between training molecules and test molecules following a scaffold split. Our study examined three representative AI models on 60 NCI-60 datasets, each with approximately 30,000 to 50,000 molecules tested on a different cancer cell line. Each dataset was split with three methods: scaffold, Butina clustering and the more accurate Uniform Manifold Approximation and Projection (UMAP) clustering. Regardless of the model, model performance is much worse with UMAP splits from the results of the 2100 models trained and evaluated for each algorithm and split. These robust results demonstrate the need for more realistic data splits to tune, compare, and select models for VS. For the same reason, avoiding the scaffold split is also recommended for other molecular property prediction problems. The code to reproduce these results is available at https://github.com/ScaffoldSplitsOverestimateVS

cell line, molecule, scaffold split, (14 more...)

arXiv.org Artificial Intelligence

2406.00873

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Ireland (0.04)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Unraveling Key Elements Underlying Molecular Property Prediction: A Systematic Study

Deng, Jianyuan, Yang, Zhibo, Wang, Hehe, Ojima, Iwao, Samaras, Dimitris, Wang, Fusheng

arXiv.org Artificial IntelligenceSep-2-2023

Artificial intelligence (AI) has been widely applied in drug discovery with a major task as molecular property prediction. Despite booming techniques in molecular representation learning, key elements underlying molecular property prediction remain largely unexplored, which impedes further advancements in this field. Herein, we conduct an extensive evaluation of representative models using various representations on the MoleculeNet datasets, a suite of opioids-related datasets and two additional activity datasets from the literature. To investigate the predictive power in low-data and high-data space, a series of descriptors datasets of varying sizes are also assembled to evaluate the models. In total, we have trained 62,820 models, including 50,220 models on fixed representations, 4,200 models on SMILES sequences and 8,400 models on molecular graphs. Based on extensive experimentation and rigorous comparison, we show that representation learning models exhibit limited performance in molecular property prediction in most datasets. Besides, multiple key elements underlying molecular property prediction can affect the evaluation results. Furthermore, we show that activity cliffs can significantly impact model prediction. Finally, we explore into potential causes why representation learning models can fail and show that dataset size is essential for representation learning models to excel.

dataset, molecule, representation, (15 more...)

arXiv.org Artificial Intelligence

2209.13492

Country: North America > United States > New York > Suffolk County > Stony Brook (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.94)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Does GNN Pretraining Help Molecular Representation?

Sun, Ruoxi, Dai, Hanjun, Yu, Adams Wei

arXiv.org Artificial IntelligenceNov-2-2022

Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on small molecular data can be negligible in many cases. We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, to see how they affect the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not always have statistically significant advantages over non-pretraining methods in many settings. Secondly, although noticeable improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits. Thirdly, hyper-parameters could have larger impacts on accuracy of downstream tasks than the choice of pretraining tasks, especially when the scales of downstream tasks are small. Finally, we provide our conjectures where the complexity of some pretraining methods on small molecules might be insufficient, followed by empirical evidences on different pretraining datasets.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2207.0601

Country: North America > United States > California > Santa Clara County > Stanford (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

Are Learned Molecular Representations Ready For Prime Time?

Yang, Kevin, Swanson, Kyle, Jin, Wengong, Coley, Connor, Eiden, Philipp, Gao, Hua, Guzman-Perez, Angel, Hopper, Timothy, Kelley, Brian, Mathea, Miriam, Palmer, Andrew, Settels, Volker, Jaakkola, Tommi, Jensen, Klavs, Barzilay, Regina

arXiv.org Machine LearningApr-2-2019

Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 15 proprietary industrial datasets spanning a wide variety of chemical endpoints. In addition, we introduce a graph convolutional model that consistently outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary datasets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.

d-mpnn, dataset, roc-auc 0, (14 more...)

arXiv.org Machine Learning

1904.01561

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > Germany (0.04)

Genre: Research Report (1.00)

Industry:

Materials > Chemicals > Commodity Chemicals > Petrochemicals (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area (0.73)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback